Skip to content

Apply Databricks Labs Repository Lockdown policy#19

Merged
gueniai merged 7 commits intomainfrom
security/repo-lockdown
Apr 30, 2026
Merged

Apply Databricks Labs Repository Lockdown policy#19
gueniai merged 7 commits intomainfrom
security/repo-lockdown

Conversation

@mjohns-databricks
Copy link
Copy Markdown
Collaborator

Summary

Applies the Databricks Labs Repository Lockdown policy to GeoBrix ahead of the 2026-03-10 SHA-pinning cutoff. Scope is lockdown items 1, 3–6 (item 2, Hatch→uv, is N/A — GeoBrix is Scala/Maven + setuptools Python with no Hatch).

Five commits on top of master:

  1. 3ff670a Add scripts/security/ action-pinning tooling (list-external-actions, resolve-action-ref, pin-gh-actions, README).
  2. 514871b Pin external GitHub Actions to commit SHAs (cutoff 2026-03-10). Every uses: org/repo@<tag> in .github/workflows/ and .github/actions/ is rewritten to @<sha> # <tag>. Local first-party uses: ./.github/actions/* refs are intentionally unchanged. Tooling is rerunnable.
  3. 0fab757 Permissions + environment: runtime hardening. Top-level contents: read added where missing; stray top-level id-token: write removed from jobs that never request OIDC. Every job using REPO_ACCESS_TOKEN (the only non-exempt secret in use) now runs in the single protected environment runtime. deploy-docs drops pages: write / id-token: write from top level — moved to the deploy job only. release.yml's environment: release renamed → runtime. release.yml and publish-maven.yml disabled via if: false with banner comments and re-enable instructions (we are not publishing to PyPI / GitHub Packages from Actions today).
  4. 7076d47 Dependabot: cooldown.default-days: 7 on maven and pip ecosystems; github-actions ecosystem intentionally absent (SHAs are refreshed manually via scripts/security/pin-gh-actions), documented in a comment.
  5. 6bd5a0b Dockerfile + install_hadoop.sh hardening. FROM ubuntu:24.04 pinned by multi-arch manifest-list digest sha256:c4a8d5503dfb…41c7b. Hadoop 3.4.0 (pinned SHA-512 from downloads.apache.org), GDAL 3.11.4 (pinned SHA-256; upstream only ships MD5, so we MD5-verified the tarball then computed SHA-256 locally), and Maven 3.9.9 (pinned SHA-512; previously did a dynamic .sha512 fetch from the same origin as the tarball → no protection against origin compromise). scripts/util/install_hadoop.sh (unreferenced manual helper) hardened with set -euo pipefail + matching SHA-512 verification.

Policy items — coverage map

Policy item Status Notes
1. SHA-pin Actions before 2026-03-10 Commit 2 + tooling in commit 1
2. Hatch → uv migration N/A No Hatch in this repo
3. Checksum-verify downloads Commit 5
4. Docker image digest pin Commit 5
5. Workflow permissions + protected env Commit 3
6. Dependabot cooldown + ecosystem hygiene Commit 4

Reviewer notes — breaking-ish changes to double-check

Some Actions were pinned at a newer major than the tag the repo was previously using (commit 514871b):

  • actions/checkout v5 → v6 (SHA de0fac2e…)
  • actions/upload-artifact v5 → v7
  • actions/download-artifact v5 → v8
  • actions/setup-node v4 → v6
  • actions/setup-python v5 → v6
  • actions/upload-pages-artifact v3 → v4

The repo's workflows still accept node20 runtime and the public API shapes are unchanged, but please confirm with a green CI run.

Operational prerequisites on the repo

Before merging:

  • Create the GitHub Environment named exactly runtime (Settings → Environments → New environment). No reviewers/wait-timer required initially — the environment binding itself is the gate for REPO_ACCESS_TOKEN scoping.
  • Move REPO_ACCESS_TOKEN from repo-level secrets to the runtime environment's secrets so it can only be read by jobs that bind to it.
  • Confirm CODECOV_TOKEN stays at the repo/org level (exempt secret — no environment needed).

Test plan

  • Confirm runtime environment exists and REPO_ACCESS_TOKEN is scoped to it
  • Green build main run (PR trigger path hits update-doc-inventory + build, both gated by environment: runtime)
  • Verify deploy-docs preview run still builds (doesn't deploy on PRs)
  • gbx:test:scala + gbx:test:python pass in Docker (no behavior change expected, but Dockerfile was rewritten around the Hadoop/GDAL/Maven fetch sections)
  • scripts/security/list-external-actions returns an empty problem list (every external ref is a SHA with a tag comment)
  • Dependabot PRs (when they land) honor the 7-day cooldown

This pull request and its description were written by Isaac.

Implements the three-script workflow from the Databricks Labs Repository
Lockdown policy: list-external-actions -> resolve-action-ref -> pin-gh-actions.

- list-external-actions: emits every third-party action referenced under
  .github/ (requires yq by Mike Farah).
- resolve-action-ref: for each action, finds the most recent release tag
  published before the cutoff (2026-03-10T00:00:00Z) and resolves it to a
  commit SHA. Handles both mono-repo conventions: subpath-prefixed tags
  (databrickslabs/sandbox/acceptance -> acceptance/v0.4.4) and top-level
  shared tags (github/codeql-action/analyze -> v4.32.6, where the subpath
  is just a directory inside a repo using a unified tag series).
- pin-gh-actions: consumes resolve-action-ref output, rewrites every
  matching `uses:` under .github/ with the SHA form + tag comment, and
  stages (but does not commit) the result. Skips databricks/databrickslabs
  actions per policy. Deviates from the blueprint reference in one way:
  does not auto-create or switch branches, because GeoBrix manages
  branches manually.

README documents the typical flow and the 2026-03-10 cutoff.

Co-authored-by: Isaac
Every third-party `uses:` under .github/workflows/ and .github/actions/ is
now pinned to the commit SHA of the most recent release published before
2026-03-10T00:00:00Z, with the release tag preserved as an inline comment
for cross-reference (the comment is informational only — reviewers must
re-verify the SHA against the upstream release). Generated by running:

  ./scripts/security/list-external-actions \
    | xargs ./scripts/security/resolve-action-ref \
    | ./scripts/security/pin-gh-actions

Resolutions (all 15 external refs, ordered; every ref was on a mutable
tag prior to this change):

  actions/cache@v4, v5            -> cdf6c1fa...  # v5.0.3
  actions/checkout@v5             -> de0fac2e...  # v6.0.2   (major bump)
  actions/deploy-pages@v4         -> d6db9016...  # v4.0.5
  actions/download-artifact@v5    -> 70fc10c6...  # v8.0.0   (major bump)
  actions/setup-java@v5           -> be666c2f...  # v5.2.0
  actions/setup-node@v4           -> 53b83947...  # v6.3.0   (major bump)
  actions/setup-python@v5         -> a309ff8b...  # v6.2.0   (major bump)
  actions/upload-artifact@v5      -> bbbca2dd...  # v7.0.0   (major bump)
  actions/upload-pages-artifact@v3-> 7b1f4a76...  # v4.0.0   (major bump)
  codecov/codecov-action@v5       -> 671740ac...  # v5.5.2
  github/codeql-action/*@v4       -> 0d579ffd...  # v4.32.6
  pypa/gh-action-pypi-publish@... -> ed0c5393...  # v1.13.0

Major-version jumps are consistent with the policy ("latest release before
the cutoff") but carry breaking-change risk — reviewers should validate
each bump against the action's CHANGELOG before merge. In particular,
upload-artifact v4+ and download-artifact v4+ changed artifact immutability
semantics; the new versions may interact with the existing upload_artifacts
composite action in ways worth exercising under CI before unblocking.

Local composite action refs (./.github/actions/*) are unaffected —
they're first-party.

Co-authored-by: Isaac
…kflows

Databricks Labs Repository Lockdown policy requires any workflow using a
non-exempt secret (anything other than GITHUB_TOKEN or CODECOV_TOKEN) to
run inside a single protected GitHub Environment. GeoBrix uses
REPO_ACCESS_TOKEN (PAT fallback for private-repo checkout) across most
workflows, so every job that calls actions/checkout with that token now
sets `environment: runtime`.

Changes:
- Added `permissions: contents: read` at top level where missing
  (codeql-analysis, publish-maven, release) and removed stray top-level
  `id-token: write` from build_main / build_python / build_scala /
  build_scala_by_package / codecov-scala-parallel / codecov-upload
  (none of those jobs request OIDC tokens).
- deploy-docs: moved `pages: write` and `id-token: write` from top level
  down to the deploy job only (least privilege). The build job keeps
  `environment: runtime` for its REPO_ACCESS_TOKEN checkout; the deploy
  job keeps its existing `environment: github-pages`.
- doc-tests: added `environment: runtime` on all three (currently
  disabled) jobs that perform REPO_ACCESS_TOKEN checkouts, so they are
  compliant when re-enabled.
- release.yml: changed `environment: release` -> `environment: runtime`
  to converge on the single protected env the policy expects.
- release.yml + publish-maven.yml: DISABLED via `if: false` on their
  publish jobs with a banner comment explaining the policy context and
  how to re-enable. GeoBrix is not publishing to PyPI or GitHub Packages
  from Actions today; we will coordinate with Labs before re-enabling.

Exempt secrets per policy (GITHUB_TOKEN, CODECOV_TOKEN) are untouched
and do not require the protected environment.

Co-authored-by: Isaac
Labs Repository Lockdown policy: every Dependabot ecosystem in the repo
must apply a cooldown so we are not the first adopters of a just-released
(possibly compromised) version. Applied `cooldown.default-days: 7` to both
maven and pip ecosystems.

The policy also excludes `github-actions` from Dependabot entirely — action
SHAs are refreshed manually via scripts/security/pin-gh-actions so bumps
are reviewed as part of the security workflow rather than as auto-opened
PRs. Added a comment documenting the intentional absence.

Co-authored-by: Isaac
Databricks Labs Repository Lockdown policy requires all build-time binary
fetches to be integrity-verified and all base images to be pinned by
digest so a compromised registry/mirror cannot silently swap bytes.

Dockerfile changes:
- Pinned `FROM ubuntu:24.04` to the multi-arch manifest-list digest
  `sha256:c4a8d5503dfb2a3eb8ab5f807da5bc69a85730fb49b5cfca2330194ebcc41c7b`
  (kept `# ubuntu:24.04` comment for human readability).
- Hadoop 3.4.0 tarball: replaced `wget | tar` stream with
  download -> sha512sum -c -> extract, using the official
  HADOOP_SHA512 from downloads.apache.org/.sha512.
- GDAL 3.11.4 tarball: same pattern with a locally-computed SHA-256.
  OSGeo only publishes MD5; we MD5-verified the upstream download
  (9f4fa4b3be48fb60d5dd76fecb11a5f6) then computed and pinned SHA-256.
- Apache Maven 3.9.9: replaced the dynamic `.sha512` fetch (which reads
  the checksum from the same origin as the tarball and therefore provides
  no protection against origin compromise) with an in-Dockerfile pinned
  MAVEN_SHA512 ARG, cross-checked against archive.apache.org.

scripts/util/install_hadoop.sh:
- Not referenced by the build; kept as a manual mirror of the Dockerfile
  flow. Rewrote with `set -euo pipefail`, a pinned HADOOP_SHA512, and
  `sha512sum -c` verification. Made executable.

Each checksum has a matching comment documenting the authoritative source
and the requirement to bump it in lockstep with the underlying version.

Co-authored-by: Isaac
@mjohns-databricks
Copy link
Copy Markdown
Collaborator Author

Notes:

  1. This project was still in-process when all the changes were imposed of late, so still needs more finalization from your team prior to official launch, things like CODECOV_TOKEN have never been properly configured.
  2. release.yml and publish-maven.yml disabled via if: false with banner comments and re-enable instructions (we are not publishing to PyPI / GitHub Packages from Actions today). Do you want these fully deleted?

Copy link
Copy Markdown

@gueniai gueniai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Is it possible to also lock PDAL to a SHA?

  2. The PR adds environment: runtime to the three disabled jobs (test-python-docs, validate-structure, test-scala-docs) — but those jobs still contain:

ref: ${{ github.event_name == 'workflow_run' && github.event.workflow_run.head_sha || github.sha }}

When triggered via workflow_run from a fork PR, github.event.workflow_run.head_sha is the fork's commit. When someone removes the if: false to
re-enable these jobs, they will check out attacker-controlled code in a job that now has REPO_ACCESS_TOKEN available — because this PR added the
environment gate.

Before this PR: disabled jobs had no environment, so re-enabling would give you fork code + no REPO_ACCESS_TOKEN.
After this PR: disabled jobs have environment: runtime, so re-enabling gives you fork code + REPO_ACCESS_TOKEN.

The environment: runtime addition is correct in principle, but without the origin guard it makes re-enablement more dangerous, not less. The guard that
must be added before removing if: false:

if: github.event.workflow_run.head_repository.full_name == github.repository

Consider adding this as a comment block (or as a disabled if: condition) directly in the file now, so whoever re-enables the jobs can't miss it.

… jobs

Addresses review feedback on PR #19:

1. Pin PDAL 2.8.2 to commit SHA 736fa0a66af4bed7105dff5fa152edf26bbb8a3a.
   Tags are mutable; switch the pdal-builder stage from `git clone -b <tag>`
   to `git fetch --depth 1 origin <SHA>` + `git checkout FETCH_HEAD`.
   New ARG PDAL_SHA is documented alongside PDAL_VERSION with the bump
   procedure, matching the Hadoop/GDAL/Maven pattern.

2. Add a SECURITY banner above the `if: false` line on each disabled job
   in doc-tests.yml (test-python-docs, test-scala-docs, validate-structure).
   These jobs now bind environment: runtime (which scopes REPO_ACCESS_TOKEN);
   combined with the workflow_run trigger and head_sha checkout used by two
   of the three jobs, naively re-enabling would expose REPO_ACCESS_TOKEN to
   fork-controlled code. Banner prescribes the required origin guard:
     if: github.event.workflow_run.head_repository.full_name == github.repository
   Banner also added to test-scala-docs since copy-paste from siblings is
   the likely re-enable path.

Co-authored-by: Isaac
@mjohns-databricks
Copy link
Copy Markdown
Collaborator Author

Addressed in 51f428b.

1. PDAL pinned to commit SHA. Resolved tag 2.8.2736fa0a66af4bed7105dff5fa152edf26bbb8a3a via the GitHub API and switched the pdal-builder stage from git clone --depth 1 -b <tag> to git fetch --depth 1 origin <SHA> + git checkout FETCH_HEAD. New ARG PDAL_SHA sits next to ARG PDAL_VERSION with the same bump documentation pattern as Hadoop / GDAL / Maven.

2. Origin-guard banner added to all three disabled jobs in doc-tests.yml. I left if: false in place (rather than swapping in the disabled-if: form) since if: false is the unambiguous "this is off" signal — and the banner now sits directly above it with the exact replacement spelled out:

if: github.event.workflow_run.head_repository.full_name == github.repository

I added the banner to test-scala-docs too, even though it doesn't currently set an explicit ref: — the sibling jobs do, and copy-paste from them is the likely re-enable path. The banner explains the environment: runtime + workflow_run + fork PR exposure path so a future re-enabler can't miss it.

Copy link
Copy Markdown

@gueniai gueniai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Required by the Databricks Labs Repository Lockdown policy. Combined with
branch protection, the listed team must approve every PR before merge.

Pattern matches sibling labs repos (ucx, blueprint, dqx): root-level
CODEOWNERS with a single `*` rule pointing to the per-repo write team.

Co-authored-by: Isaac
@mjohns-databricks
Copy link
Copy Markdown
Collaborator Author

Added CODEOWNERS in b649702.

Mirrors the sibling labs pattern (ucx, blueprint, dqx): root-level CODEOWNERS with a single * rule pointing to the per-repo write team. The databrickslabs/geobrix-write team already exists and is the natural target — once branch protection is configured to require code-owner review, every PR will need approval from a member of that team before merge.

@gueniai gueniai added this pull request to the merge queue Apr 30, 2026
Merged via the queue into main with commit ab812cb Apr 30, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants